Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backend): Synced ScheduledWorkflow CRs on apiserver startup #11469

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

hbelmiro
Copy link
Contributor

@hbelmiro hbelmiro commented Dec 16, 2024

Resolves: #11296

This PR changes apiserver to patch existing ScheduledWorkflow CRs for each Job on startup to reflect the current KFP version deployed.

Testing

  1. Create recurring runs

  2. Make sure the recurring runs are running

  3. Patch their swf CRs to force failures
    3.1 Get the swf CRs

    kubectl get swf
    NAME                             AGE
    heya8gqgg                        27m
    howdyxt5fj                       26m
    runofpipelinewithconditio5zwcj   60m

    3.2 For each swf patch them with an invalid workflow spec to force failures. At this point, the recurring runs will start to fail due to the invalid spec.

    kubectl patch scheduledworkflow heya8gqgg --type='merge' -p '{"spec":{"workflow":{"spec":{"dynamicKey1":"dynamicValue1","dynamicKey2":"dynamicValue2"}}}}'
  4. Build a new apiserver image

  5. Edit the ml-pipeline deployment to use the new apiserver image

  6. The new apiserver pod will fix the swf CRs and the recurring runs will run successfully again

See a video demonstrating the test:

kfp-issue-11296.mp4

Checklist:

Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@hbelmiro hbelmiro force-pushed the issue-11296 branch 2 times, most recently from 8ed1796 to 332cc47 Compare December 16, 2024 19:27
@google-oss-prow google-oss-prow bot added size/L and removed size/XXL labels Dec 16, 2024
@hbelmiro hbelmiro changed the title WIP - Update SWF CRs on apiserver startup fix(backend): Synced ScheduledWorkflow CRs on apiserver startup Dec 17, 2024
@hbelmiro hbelmiro marked this pull request as ready for review December 17, 2024 20:26
@google-oss-prow google-oss-prow bot requested a review from HumairAK December 17, 2024 20:26
@hbelmiro
Copy link
Contributor Author

@HumairAK @chensun can you guys PTAL?

@@ -106,6 +106,13 @@ func main() {
}
log.SetLevel(level)

ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute)
defer cancel()
err = resourceManager.SyncSwfCrs(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The persistence agent seems to already have a reconcile loop for scheduled workflows. If I'm reading the code right, on start up, it'll reconcile everything and then handle creates, updates, and deletes.

Could the migration logic be added to the persistence agent instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably could. But that doesn't sound like a persistence agent's responsibility. It doesn't sound like API server's responsibility too, but I think it fits better there since we have a jobStore.

My original plan was to do it in https://github.com/kubeflow/pipelines/tree/master/backend/src/crd/controller/scheduledworkflow, but we would have to make HTTP calls to the API server. Then we decided to leave it in the API server.

It would be nice to hear others' opinions about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hbelmiro I agree that persistence agent isn't the best fit. That controller you linked could be a good fit.

I think we should consider adding a KFP version column to the jobs table so that you can skip generating and updating scheduled workflows if the version matches.

If were to pursue using the scheduled workflow controller, here is an idea:

  • Have the scheduled workflow controller query the API server health endpoint at start up to get the tag_name value to see what version of KFP we are on. In the background, it could keep querying the API server to see if the version changed.
  • The scheduled workflow controller's reconcile loop checks an annotation of pipelines.kubeflow.org/version and if the value of that annotation doesn't match tag_name, then the workflow definition is updated and the annotation value is set to the current version.
  • When the API server creates a ScheduledWorkflow object, it sets the pipelines.kubeflow.org/version annotation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this idea. My only concern is about making HTTP calls to the API server.
How about implementing it in a follow-up PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no problem! What do you think of the other comnment of adding a KFP version column to the jobs table so that you can skip generating and updating scheduled workflows if the version matches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mprahl where do you see it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mprahl @hbelmiro beware that, in the past, we had also problems with scheduled runs after upgrading RHOAI:

If I recall well, the problem in https://issues.redhat.com/browse/RHOAIENG-10790 was that the kfp-launcher pod needed an additional parameter to be set and the scheduled runs created before the upgrade didn't provide that parameter, so they failed to run. In this case it would have been helpful to rebuild all scheduled runs on kfp api server startup, regardless the kfp api server version.

What about providing an environment variable to control the behavior?
ML_PIPELINE_SCHEDULEDRUNS_REBUILDPOLICY=[disable,always,...]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could set the API Server image name and tag from the Downward API. That way, downstream projects would also benefit even when the KFP version doesn't change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mprahl this discussion is still pending.

Since we're now checking if the current swf is deeply equal the new one, I think we could skip this verification.
We would still be fetching the current swf and generating the new one unnecessarily, but I think that's a fair price to avoid adding more complexity.

WDYT?

I'd like to have @jgarciao @HumairAK's opinion on this too.

(My opinion on this is not strong and I'm fine with implementing this if you guys think it's better)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that the current code is good enough for now and the version metadata isn't needed at the moment. I think the extra version data would be nice but I'm thinking there isn't likely to be very many scheduled workflows, so the extra cost that happens asynchronously seems okay to me.

defer cancel()
err := resourceManager.ReconcileSwfCrs(ctx)
if err != nil {
log.Errorf("Could not reconcile the ScheduledWorkflow Kubernetes resources: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your approach of not exiting with an error code in this case, but I'm curious what others think. My concern is that it'd be hard to know that this failed as a KFP admin but it also doesn't seem warranted to keep the API server from running if this can't succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I was implementing this from scratch, I'd definitely exit with an error. But since the original behavior (the one that this PR aims to fix) is to run the API server with outdated swfs, existing deployments may start to fail after an upgrade even when they would work with outdated swfs.
At the same time, silently ignoring it and the outdated swfs "working" silently may be even more dangerous than just exiting with an error. (Now while I'm writing this, I'm tilting more to exit with an error)

Yeah, let's see what others think.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the api server is still running users will be able to delete or clone the failing scheduled runs using the UI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point @jgarciao.

@hbelmiro
Copy link
Contributor Author

This PR depends on:

I included those commits here while the PRs are not merged.

cc @mprahl

hbelmiro and others added 14 commits January 2, 2025 14:42
Co-authored-by: Matt Prahl <[email protected]>
Signed-off-by: Helber Belmiro <[email protected]>
Signed-off-by: Helber Belmiro <[email protected]>
…the value is already correct?

Signed-off-by: Helber Belmiro <[email protected]>
Signed-off-by: Helber Belmiro <[email protected]>
Signed-off-by: Helber Belmiro <[email protected]>
Signed-off-by: Helber Belmiro <[email protected]>
Signed-off-by: Helber Belmiro <[email protected]>
@mprahl
Copy link
Contributor

mprahl commented Jan 2, 2025

@hbelmiro is there any way to add test coverage for this?

Co-authored-by: Matt Prahl <[email protected]>
Signed-off-by: Helber Belmiro <[email protected]>
@hbelmiro
Copy link
Contributor Author

hbelmiro commented Jan 2, 2025

@hbelmiro is there any way to add test coverage for this?

@mprahl I'm not sure it's worth it. I'd have to replicate this reproducer, which means recreating the API Server pod. That would probably affect other tests.
Also, if we implement the version/image check, it will be even harder to test.

@mprahl
Copy link
Contributor

mprahl commented Jan 2, 2025

@hbelmiro is there any way to add test coverage for this?

@mprahl I'm not sure it's worth it. I'd have to replicate this reproducer, which means recreating the API Server pod. That would probably affect other tests. Also, if we implement the version/image check, it will be even harder to test.

I was thinking you could have a test that calls ReconcileSwfCrs with the fake client manager. See backend/src/apiserver/resource/resource_manager_test.go for examples of where this is used. You should be able to create an invalid ScheduledWorkflow in the fake store (in memory SQLite database), call ReconcileSwfCrs, and verify it was updated.

Signed-off-by: Helber Belmiro <[email protected]>
@hbelmiro
Copy link
Contributor Author

hbelmiro commented Jan 3, 2025

@hbelmiro is there any way to add test coverage for this?

@mprahl I'm not sure it's worth it. I'd have to replicate this reproducer, which means recreating the API Server pod. That would probably affect other tests. Also, if we implement the version/image check, it will be even harder to test.

I was thinking you could have a test that calls ReconcileSwfCrs with the fake client manager. See backend/src/apiserver/resource/resource_manager_test.go for examples of where this is used. You should be able to create an invalid ScheduledWorkflow in the fake store (in memory SQLite database), call ReconcileSwfCrs, and verify it was updated.

@mprahl actually what needs to be updated is the CR. But that's a good idea anyway. I've added the test.

Copy link
Contributor

@mprahl mprahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mprahl
Once this PR has been reviewed and has the lgtm label, please assign chensun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[backend] Updating KFP doesn't update existing scheduled runs
3 participants